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Abstract 

We  present  a  framework  for  learning  in  hidden  Markov  models  with  distributed  state 
representations.  Within  this  framework,  we  derive  a  learning  algorithm  based  on  the 
Expectation-Maximization  (EM)  procedure  for  maximum  likelihood  estimation.  Anal¬ 
ogous  to  the  standard  Baum- Welch  update  rules,  the  M-step  of  our  algorithm  is  exact 
and  can  be  solved  analytically.  However,  due  to  the  combinatorial  nature  of  the  hidden 
state  representation,  the  exact  E-step  is  intractable.  A  simple  and  tractable  mean  held 
approximation  is  derived.  Empirical  results  on  a  set  of  problems  suggest  that  both  the 
mean  held  approximation  and  Gibbs  sampling  are  viable  alternatives  to  the  computa¬ 
tionally  expensive  exact  algorithm. 
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1  Introduction 


A  problem  of  fundamental  interest  to  machine  learning  is  time  series  modeling.  Due  to  the  sim¬ 
plicity  and  efficiency  of  its  parameter  estimation  algorithm,  the  hidden  Markov  model  (HMM)  has 
emerged  as  one  of  the  basic  statistical  tools  for  modeling  discrete  time  series,  folding  widespread 
application  in  the  areas  of  speech  recognition  (Rabiner  and  Juang,  1986)  and  computational  molec¬ 
ular  biology  (Baldi  et  ah,  1994).  An  HMM  is  essentially  a  mixture  model,  encoding  information 
about  the  history  of  a  time  series  in  the  value  of  a  single  multinomial  variable  (the  hidden  state). 
This  multinomial  assumption  allows  an  efficient  parameter  estimation  algorithm  to  be  derived  (the 
Baum- Welch  algorithm).  However,  it  also  severely  limits  the  representational  capacity  of  HMMs. 
For  example,  to  represent  30  bits  of  information  about  the  history  of  a  time  sequence,  an  HMM 
would  need  230  distinct  states.  On  the  other  hand  an  HMM  with  a  distributed  state  representa¬ 
tion  could  achieve  the  same  task  with  30  binary  units  (Williams  and  Hinton,  1991).  This  paper 
addresses  the  problem  of  deriving  efficient  learning  algorithms  for  hidden  Markov  models  with 
distributed  state  representations. 

The  need  for  distributed  state  representations  in  HMMs  can  be  motivated  in  two  ways.  First,  such 
representations  allow  the  state  space  to  be  decomposed  into  features  that  naturally  decouple  the 
dynamics  of  a  single  process  generating  the  time  series.  Second,  distributed  state  representations 
simplify  the  task  of  modeling  time  series  generated  by  the  interaction  of  multiple  independent 
processes.  For  example,  a  speech  signal  generated  by  the  superposition  of  multiple  simultaneous 
speakers  can  be  potentially  modeled  with  such  an  architecture. 

Williams  and  Hinton  (1991)  first  formulated  the  problem  of  learning  in  HMMs  with  distributed 
state  representation  and  proposed  a  solution  based  on  deterministic  Boltzmann  learning.  The  ap¬ 
proach  presented  in  this  paper  is  similar  to  Williams  and  Hinton’s  in  that  it  is  also  based  on  a 
statistical  mechanical  formulation  of  hidden  Markov  models.  However,  our  learning  algorithm  is 
quite  different  in  that  it  makes  use  of  the  special  structure  of  HMMs  with  distributed  state  rep¬ 
resentation,  resulting  in  a  more  efficient  learning  procedure.  Anticipating  the  results  in  section  2, 
this  learning  algorithm  both  obviates  the  need  for  the  two-phase  procedure  of  Boltzmann  machines, 
and  has  an  exact  M-step.  A  different  approach  comes  from  Saul  and  Jordan  (1995),  who  derived 
a  set  of  rules  for  computing  the  gradients  required  for  learning  in  HMMs  with  distributed  state 
spaces.  However,  their  methods  can  only  be  applied  to  a  limited  class  of  architectures. 


2  Factorial  hidden  Markov  models 

Hidden  Markov  models  are  a  generalization  of  mixture  models.  At  any  time  step,  the  probability 
density  over  the  observables  defined  by  an  HMM  is  a  mixture  of  the  densities  defined  by  each  state 
in  the  underlying  Markov  model.  Temporal  dependencies  are  introduced  by  specifying  that  the 
prior  probability  of  the  state  at  time  t  depends  on  the  state  at  time  t  —  1  through  a  transition 
matrix,  P  (Figure  la). 

Another  generalization  of  mixture  models,  the  cooperative  vector  quantizer  (CVQ;  Hinton  and 
Zemel,  1994  ),  provides  a  natural  formalism  for  distributed  state  representations  in  HMMs.  Whereas 
in  simple  mixture  models  each  data  point  must  be  accounted  for  by  a  single  mixture  component, 
in  CVQs  each  data  point  is  accounted  for  by  the  combination  of  contributions  from  many  mixture 
components,  one  from  each  separate  vector  quantizer.  The  total  probability  density  modeled  by  a 
CVQ  is  also  a  mixture  model;  however  this  mixture  density  is  assumed  to  factorize  into  a  product 
of  densities,  each  density  associated  with  one  of  the  vector  quantizers.  Thus,  the  CVQ  is  a  mixture 
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model  with  distributed  representations  for  the  mixture:  components. 

Factorial  hidden  Markov  models1  combine  the  state  transition  structure  of  HMMs  with  the  dis¬ 
tributed  representations  of  GVQs  (Figure  lb).  Each  of  the  d  underlying  Markov  models  has  a 
discrete  state  s(  at  time  t  and  transition  probability  matrix  Pi.  As  in  the  GVQ,  the  states  are  mu¬ 
tually  exclusive  within  each  vector  cpiantizer  and  we  assume  real-valued  outputs.  The  secpience  of 
observable  output  vectors  is  generated  from  a  normal  distribution  with  mean  given  by  the  weighted 
combination  of  the  states  of  the  underlying  Markov  models: 

where  C  is  a  common  covariance  matrix.  The  Avvalued  states  s8-  are  represented  as  discrete  column 
vectors  with  a  1  in  one  position  and  0  everywhere  else;  the  mean  of  the  observable  is  therefore  a 
combination  of  columns  from  each  of  the  Wi  matrices. 


Figure  1.  a)  Hidden  Markov  model,  b)  Factorial  hidden  Markov  model. 


We  capture  the  above  probability  model  by  defining  the  energy  of  a  secpience  of  T  states  and 
observations,  {(sf,  yf  )}^=1,  which  we  abbreviate  to  {s,y},  as: 

«({s.y})  =  ^E  y'-£»X  c-1  y'-Ewy  -EXXXsi-1,  m 

Z  f=l  L  i=l  J  L  i=l  J  f=l  i=l 

where  [A;]p  =  log  P(sG  |s*;_1)  such  that  Y^j=i  eA'P'  =  and  '  denotes  matrix  transpose.  Priors 
for  the  initial  state,  s1,  are  introduced  by  setting  the  second  term  in  (1)  to  —  J2i=i  s;  log71";-  The 
probability  model  is  defined  from  this  energy  by  the  Boltzmann  distribution 

P({SW})  =  ^exp{-'H({s,y})}.  (2) 

1We  refer  to  HMMs  with  distributed  state  as  factorial  HMMs  as  the  features  of  the  distributed  state  factorize  the 
total  state  representation. 
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Note  that  like  in  the  CVQ  (Ghahramani,  1995),  the  unclamped  partition  function 

Z  =  f 

{S} 

evaluates  to  a  constant,  independent  of  the  parameters.  This  can  be  shown  by  first  integrating  the 
Gaussian  variables,  removing  all  dependency  on  {y},  and  then  summing  over  the  states  using  the 
constraint  on 


The  EM  algorithm  for  Factorial  HMMs 

As  in  HMMs,  the  parameters  of  a  factorial  HMM  can  be  estimated  via  the  EM  (Baum- Welch) 
algorithm.  This  procedure  iterates  between  assuming  the  current  parameters  to  compute  proba¬ 
bilities  over  the  hidden  states  (E-step),  and  using  these  probabilities  to  maximize  the  expected  log 
likelihood  of  the  parameters  (M-step). 

Using  the  likelihood  (2),  the  expected  log  likelihood  of  the  parameters  is 

Q(rew\<f>)  =  (-n({s,y})  -  logZ)c,  (3) 

where  </>  =  {Wi,  Pi,  C}f=1  denotes  the  current  parameters,  and  (-)c  denotes  expectation  given  the 
clamped  observation  sequence  and  </>.  Given  the  observation  sequence,  the  only  random  variables  are 
the  hidden  states.  Expanding  equation  (3)  and  limiting  the  expectation  to  these  random  variables 
we  find  that  the  statistics  that  need  to  be  computed  for  the  E-step  are  (s()c,  (s(s)  }c,  and  (s(s(-1  }c. 
Note  that  in  standard  HMM  notation  (Rabiner  and  Juang,  1986),  (s()c  corresponds  to  and 
(s(s(— 1  }c  corresponds  to  whereas  (s(s)  }c  has  no  analogue  when  there  is  only  a  single  underlying 
Markov  model.  The  M-step  uses  these  expectations  to  maximize  Q  with  respect  to  the  parameters. 

The  constant  partition  function  allowed  us  to  drop  the  second  term  in  (3).  Therefore,  unlike 
the  Boltzmann  machine,  the  expected  log  likelihood  does  not  depend  on  statistics  collected  in  an 
unclamped  phase  of  learning,  resulting  in  much  faster  learning  than  the  traditional  Boltzmann 
machine  (Neal,  1992). 


M-step 


Setting  the  derivatives  of  Q  with  respect  to  the  output  weights  to  zero,  we  obtain  a  linear  system 
of  equations  for  W : 


Tnew 


- 

t 

" 

]C(SS')c 

]C(s)w' 

_N,t 

_N,t 

where  s  and  W  are  the  vector  and  matrix  of  concatenated  s4-  and  W),  respectively, denotes 
summation  over  a  data  set  of  N  sequences,  and  f  is  the  Moore-Penrose  pseudo-inverse.  To  estimate 
the  log  transition  probabilities  we  solve  dQ/d[Ai\ji  =  0  subject  to  the  constraint  =  1, 

obtaining 


log 


Ejy, t(sijsll  \ 


(4) 


The  covariance  matrix  can  be  similarly  estimated: 


cnew  =  J2yy'  -  I]y(s)c(ss/)I(s)cy/- 

N,t  N,t 


The  M-step  equations  can  therefore  be  solved  analytically;  furthermore,  for  a  single  underlying 
Markov  chain,  they  reduce  to  the  traditional  Baum- Welch  re-estimation  equations. 
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E-step 

Unfortunately,  as  in  the  simpler  CVQ,  the  exact  E-step  for  factorial  HMMs  is  computationally 
intractable.  For  example,  the  expectation  of  the  jth  unit  in  vector  i  at  time  step  t,  given  {y},  is: 

(sh)c  =  P{s%  =  i\{y},<t>) 

k 

=  E 

Although  the  Markov  property  can  be  used  to  obtain  a  forward-backward-like  factorization  of  this 
expectation  across  time  steps,  the  sum  over  all  possible  configurations  of  the  other  hidden  units 
within  each  time  step  is  unavoidable.  For  a  data  set  of  N  sequences  of  length  T,  the  full  E-step 
calculated  through  the  forward-backward  procedure  has  time  complexity  0(NTk2d).  Although 
more  careful  bookkeeping  can  reduce  the  complexity  to  0(NT  dkd+1 ),  the  exponential  time  cannot 
be  avoided.  This  intractability  of  the  exact  E-step  is  due  inherently  to  the  cooperative  nature  of 
the  model — the  setting  of  one  vector  only  determines  the  mean  of  the  observable  if  all  the  other 
vectors  are  fixed. 

Rather  than  summing  over  all  possible  hidden  state  patterns  to  compute  the  exact  expectations, 
a  natural  approach  is  to  approximate  them  through  a  Monte  Carlo  method  such  as  Gibbs  sampling. 
The  procedure  starts  with  a  clamped  observable  sequence  {y}  and  a  random  setting  of  the  hidden 
states  {s*}.  At  each  time  step,  each  state  vector  is  updated  stochastically  according  to  its  probability 
distribution  conditioned  on  the  setting  of  all  the  other  state  vectors:  s*  p(s*|{yMsJ  :  j  + 
i  or  r  ^  f },</>)•  These  conditional  distributions  are  straightforward  to  compute  and  a  full  pass 
of  Gibbs  sampling  requires  O(NTkd)  operations.  The  first  and  second-order  statistics  needed 
to  estimate  (s*)c,  (s*s f-)c  and  (s*s*-1  )c  are  collected  using  the  sC’s  visited  and  the  probabilities 
estimated  during  this  sampling  process. 


Mean  field  approximation 

A  different  approach  to  computing  the  expectations  in  an  intractable  system  is  given  by  mean  held 
theory.  A  mean  held  approximation  for  factorial  HMMs  can  be  obtained  by  dehning  the  energy 
function 

Wa  y»  =  \  E  [y*  -  '  c~l  \yf  -  fif]  -  E  sf  log  ml- 

t  t,i 

which  results  in  a  completely  factorized  approximation  to  probability  density  (2): 

Hi8, y»  «  rW-b  [y*-^]'^-1  [yf-//]}n(mp)s‘j  (5) 

t  z  t,i,j 


In  this  approximation,  the  observables  are  independently  Gaussian  distributed  with  mean  /U  and 
each  hidden  state  vector  is  multinomially  distributed  with  mean  m*.  This  approximation  is  made  as 
tight  as  possible  by  chosing  the  mean  held  parameters  /T  and  m*  that  minimize  the  Kullback-Liebler 
divergence 


kCC{P\\P)  =  (logP)p  —  (logP)p 


where  (■)p  denotes  expectation  over  the  mean  held  distribution  (5).  With  the  observables  clamped, 
pd  can  be  set  equal  to  the  observable  yt.  Minimizing  IC£(P\\P)  with  respect  to  the  mean  held 


4 


parameters  for  the  states  results  in  a  fixed-point  equation  which  can  be  iterated  until  convergence: 

m‘"'w  =  [/-#•]+  -  IdtagWC-'nq  -  1  (6) 

+.4,mr1  +  4'm|+1} 

where  y*  =  Wimj  and  <r{-}  is  the  softmax  exponential,  normalized  over  each  hidden  state  vector. 
The  first  term  is  the  projection  of  the  error  in  the  observable  onto  the  weights  of  state  vector  i — the 
more  a  hidden  unit  can  reduce  this  error,  the  larger  its  mean  held  parameter.  The  next  three 
terms  arise  from  the  fact  that  ( s2i3)p  is  equal  to  and  not  ra|.  The  last  two  terms  introduce 
dependencies  forward  and  backward  in  time.  Each  state  vector  is  asynchronously  updated  using 
(6),  at  a  time  cost  of  O(NTkd)  per  iteration.  Convergence  is  diagnosed  by  monitoring  the  ICC 
divergence  in  the  mean  held  distribution  between  successive  time  steps;  in  practice  convergence  is 
very  rapid  (about  2  to  10  iterations  of  (6)). 


3  Empirical  Results 

We  compared  three  EM  algorithms  for  learning  in  factorial  HMMs — using  Gibbs  sampling,  mean 
held  approximation,  and  the  exact  (exponential)  E  step — on  the  basis  of  performance  and  speed 
on  randomly  generated  problems.  Problems  were  generated  from  a  factorial  HMM  structure,  the 
parameters  of  which  were  sampled  from  a  uniform  [0, 1]  distribution,  and  appropriately  normalized 
to  satisfy  the  sum-to-one  constraints  of  the  transition  matrices  and  priors.  Also  included  in  the 
comparison  was  a  traditional  HMM  with  as  many  states  ( kd )  as  the  factorial  HMM. 

Table  1  summarizes  the  results.  Even  for  moderately  large  state  spaces  (d  >  3  and  k  >  3) 
the  standard  HMM  with  kd  states  suffers  from  severe  overhtting.  Furthermore,  both  the  standard 
HMM  and  the  exact  E-step  factorial  HMM  are  extremely  slow  on  the  larger  problems.  The  Gibbs 
sampling  and  mean  held  approximations  offer  roughly  comparable  performance  at  a  great  increase 
in  speed. 

4  Discussion 

The  basic  contribution  of  this  paper  is  a  learning  algorithm  for  hidden  Markov  models  with  dis¬ 
tributed  state  representations.  The  standard  Baum- Welch  procedure  is  intractable  for  such  archi¬ 
tectures  as  the  size  of  the  state  space  generated  from  the  cross  product  of  d  Avvalued  features  is 
0(kd ),  and  the  time  complexity  of  Baum- Welch  is  quadratic  in  this  size.  More  importantly,  unless 
special  constraints  are  applied  to  this  cross-product  HMM  architecture,  the  number  of  parameters 
also  grows  as  0(k2d ),  which  can  result  in  severe  overhtting. 

The  architecture  for  factorial  HMMs  presented  in  this  paper  did  not  include  any  coupling  between 
the  underlying  Markov  chains.  It  is  possible  to  extend  the  algorithm  presented  to  architectures  which 
incorporate  such  couplings.  However,  these  couplings  must  be  introduced  with  caution  as  they  may 
result  either  in  an  exponential  growth  in  parameters  or  in  a  loss  of  the  constant  partition  function 
property. 

The  learning  algorithm  derived  in  this  paper  assumed  real- valued  observables.  The  algorithm  can 
also  be  derived  for  HMMs  with  discrete  observables,  an  architecture  closely  related  to  sigmoid  belief 
networks  (Neal,  1992).  However,  the  nonlinearities  induced  by  discrete  observables  make  both  the 
E-step  and  M-step  of  the  algorithm  more  difficult. 
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Table  1:  Comparison  of  factorial  HMM  on  four  problems  of  varying  size 


d 

k 

Alg 

# 

Train 

Test 

Cycles 

Time/Cycle 

3 

2 

HMM 

5 

649 

±  8 

358 

±  81 

33 

±  19 

1.1  s 

Exact 

877 

±  0 

768 

±  0 

22 

±  6 

3.0  s 

Gibbs 

710 

±  152 

627 

±  129 

28 

±  11 

6.0  s 

MF 

755 

±  168 

670 

±  137 

32 

±  22 

1.2  s 

3 

3 

HMM 

5 

670 

±  26 

-782 

±  128 

23 

±  10 

3.6  s 

Exact 

568 

±  164 

276 

±  62 

35 

±  12 

5.2  s 

Gibbs 

564 

±  160 

305 

±  51 

45 

±  16 

9.2  s 

MF 

495 

±  83 

326 

±  62 

38 

±  22 

1.6  s 

5 

2 

HMM 

5 

588 

±  37 

-2634 

±  566 

18 

±  1 

5.2  s 

Exact 

223 

±  76 

159 

±  80 

31 

±  17 

6.9  s 

Gibbs 

123 

±  103 

73 

±  95 

40 

±  5 

12.7  s 

MF 

292 

±  101 

237 

±  103 

54 

±  29 

2.2  s 

5 

3 

HMM 

3 

1671,1678,1690 

-oo,- 

-00,-00 

14,14,12 

90.0  s 

Exact 

-55,- 

354,-295 

-123,- 

378,-402 

90,100,100 

51.0  s 

Gibbs 

-123, 

-160,-194 

-202,- 

237,-307 

100,73,100 

14.2  s 

MF 

-287, 

-286,-296 

-364,-370,-365 

100,100,100 

4.7  s 

Table  1.  Data  was  generated  from  a  factorial  HMM  with  d  underlying  Markov 
models  of  k  states  each.  The  training  set  was  10  sequences  of  length  20  where  the 
observable  was  a  4-dimensional  vector;  the  test  set  was  20  such  sequences.  HMM 
indicates  a  hidden  Markov  model  with  kd  states;  the  other  algorithms  are  factorial 
HMMs  with  d  underlying  fc-state  models.  Gibbs  sampling  used  10  samples  of  each 
state.  The  algorithms  were  run  until  convergence,  as  monitored  by  relative  change 
in  the  likelihood,  or  a  maximum  of  100  cycles.  The  #  column  indicates  number  of 
runs.  The  Train  and  Test  columns  show  the  log  likelihood  ±  one  standard  deviation 
on  the  two  data  sets.  The  last  column  indicates  approximate  time  per  cycle  on  a 
Silicon  Graphics  R4400  processor  running  Matlab. 
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In  conclusion,  we  have  presented  Gibbs  sampling  and  mean  field  learning  algorithms  for  factorial 
hidden  Markov  models.  Such  models  incorporate  the  time  series  modeling  capabilities  of  hidden 
Markov  models  and  the  advantages  of  distributed  representations  for  the  state  space.  Future  work 
will  concentrate  on  a  more  efficient  mean  held  approximation  in  which  the  forward-backward  algo¬ 
rithm  is  used  to  compute  the  E-step  exactly  within  each  Markov  chain,  and  mean  held  theory  is 
used  to  handle  interactions  between  chains  (Saul  and  Jordan,  1996). 
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